AITopics | claude opus 4

Collaborating Authors

claude opus 4

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Claude Opus 4.8 is learning to say AI's three hardest words: "I don't know"

PCWorldMay-28-2026, 19:36:32 GMT

PCWorld reports that Anthropic's Claude Opus 4.8 focuses on improving AI honesty by teaching the model to admit when it lacks information. The model achieved near-perfect scores in honesty benchmarks for coding questions and exhibited evaluation awareness during testing. Opus 4.8 represents a significant step forward in making AI systems more transparent about their knowledge limitations and uncertainties. Honesty is a key sticking point with even the most powerful LLMs. It's not so much that they're intentionally lying to you; instead, they'll confidently tell you things they're not 100 percent (or even 50 percent) sure about. With Opus 4.8, its latest Claude model, Anthropic says it's made Claude more honest about telling you what it doesn't know, or if it has a low level of confidence in what it's telling you. Released Thursday, Claude Opus 4.8 is Claude Mythos Preview, Anthropic's new "frontier" model that's so powerful, only a handful of "trusted partners" have been allowed to test it for security reasons.

artificial intelligence, large language model, natural language, (13 more...)

PCWorld

Industry:

Information Technology > Security & Privacy (1.00)
Leisure & Entertainment > Games > Computer Games (0.54)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Improving reproducibility by controlling random seed stability in machine learning based estimation via bagging

Williams, Nicholas, Schuler, Alejandro

arXiv.org Machine LearningApr-21-2026

Predictions from machine learning algorithms can vary across random seeds, inducing instability in downstream debiased machine learning estimators. We formalize random seed stability via a concentration condition and prove that subbagging guarantees stability for any bounded-outcome regression algorithm. We introduce a new cross-fitting procedure, adaptive cross-bagging, which simultaneously eliminates seed dependence from both nuisance estimation and sample splitting in debiased machine learning. Numerical experiments confirm that the method achieves the targeted level of stability whereas alternatives do not. Our method incurs a small computational penalty relative to standard practice whereas alternative methods incur large penalties.

algorithm, artificial intelligence, machine learning, (18 more...)

arXiv.org Machine Learning

2604.17694

Country:

Europe > Austria > Vienna (0.14)
North America > United States > California > Alameda County > Berkeley (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.36)

Add feedback

This is the most misunderstood graph in AI

MIT Technology ReviewFeb-5-2026, 10:00:00 GMT

To some, METR's "time horizon plot" indicates that AI utopia--or apocalypse--is close at hand. The truth is more complicated. Every time OpenAI, Google, or Anthropic drops a new frontier large language model, the AI community holds its breath. It doesn't exhale until METR, an AI research nonprofit whose name stands for "Model Evaluation & Threat Research," updates a now-iconic graph that has played a major role in the AI discourse since it was first released in March of last year. The graph suggests that certain AI capabilities are developing at an exponential rate, and more recent model releases have outperformed that already impressive trend. That was certainly the case for Claude Opus 4.5, the latest version of Anthropic's most powerful model, which was released in late November.

artificial intelligence, large language model, natural language, (14 more...)

MIT Technology Review

Country: North America > United States > Illinois (0.14)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)

Add feedback

How Claude Code Is Reshaping Software--and Anthropic

WIREDJan-22-2026, 19:00:00 GMT

WIRED spoke with Boris Cherny, head of Claude Code, about how the viral coding tool is changing the way Anthropic works. Engineers in Silicon Valley have been raving about Anthropic's AI coding tool, Claude Code, for months. But recently, the buzz feels as if it's reached a fever pitch. Earlier this week, I sat down with Boris Cherny, head of Claude Code, to try to understand how the company is meeting this moment. "We built the simplest possible thing," said Cherny. "The craziest thing was learning three months ago that half of the sales team at Anthropic uses Claude Code every week."

anthropic, claude code, cowork, (16 more...)

WIRED

Country:

North America > United States > California (0.35)
Asia > China (0.05)
Europe > Slovakia (0.04)
(3 more...)

Industry: Information Technology (0.49)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

Add feedback

Why the World's Best AI Systems Are Still So Bad at Pokémon

TIME - TechJan-13-2026, 19:10:01 GMT

Why the World's Best AI Systems Are Still So Bad at Pokémon Pillay is an editorial fellow at TIME. Pillay is an editorial fellow at TIME. Right now, live on Twitch, you can watch three of the world's smartest AI systems-- GPT 5.2, Claude Opus 4.5, and Gemini 3 Pro --doing their best to beat classic Pokémon games. At least by human standards, they are not very good. The systems are slow, overconfident, and often confused.

best ai system, claude, harness, (11 more...)

TIME - Tech

Country:

North America > United States (0.05)
Europe > France (0.05)
Asia > China (0.05)
Africa (0.05)

Industry: Leisure & Entertainment > Games > Computer Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Reasoning Models Ace the CFA Exams

Patel, Jaisal, Chen, Yunzhe, He, Kaiwen, Wang, Keyi, Li, David, Xiao, Kairong, Liu, Xiao-Yang

arXiv.org Artificial IntelligenceDec-10-2025

Previous research has reported that large language models (LLMs) demonstrate poor performance on the Chartered Financial Analyst (CFA) exams. However, recent reasoning models have achieved strong results on graduate-level academic and professional examinations across various disciplines. In this paper, we evaluate state-of-the-art reasoning models on a set of mock CFA exams consisting of 980 questions across three Level I exams, two Level II exams, and three Level III exams. Using the same pass/fail criteria from prior studies, we find that most models clear all three levels. The models that pass, ordered by overall performance, are Gemini 3.0 Pro, Gemini 2.5 Pro, GPT-5, Grok 4, Claude Opus 4.1, and DeepSeek-V3.1. Specifically, Gemini 3.0 Pro achieves a record score of 97.6% on Level I. Performance is also strong on Level II, led by GPT-5 at 94.3%. On Level III, Gemini 2.5 Pro attains the highest score with 86.4% on multiple-choice questions while Gemini 3.0 Pro achieves 92.0% on constructed-response questions.

exam, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2512.0827

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.46)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

IndiMathBench: Autoformalizing Mathematical Reasoning Problems with a Human Touch

Biyani, Param, Kirtania, Shashank, Bajpai, Yasharth, Gulwani, Sumit, Tiwari, Ashish

arXiv.org Artificial IntelligenceDec-2-2025

We introduce IndiMathBench, a human-verified benchmark designed to evaluate mathematical theorem proving, curated using an AI-powered human-assisted pipeline for formalizing natural language problems in Lean. IndiMathBench is composed of 312 formal Lean 4 theorems paired with their corresponding informal problem statements, sourced from Indian Mathematics Olympiads. Through category-based retrieval, iterative compiler feedback, and multi-model ensembles, our pipeline generates candidate formalizations that experts efficiently validate via an interactive dashboard with automated quality summaries. Evaluation across multiple frontier models demonstrates that autoformalization remains challenging, with substantial gaps between syntactic validity and semantic correctness, while theorem proving success rates remain low even with iterative refinement, demonstrating that \benchmark~presents a challenging testbed for mathematical reasoning. IndiMathBench is available at https://github.com/prmbiy/IndiMathBench.

large language model, logic & formal reasoning, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2512.00997

Genre: Research Report (1.00)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (0.68)

Add feedback

ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration

Su, Hongjin, Diao, Shizhe, Lu, Ximing, Liu, Mingjie, Xu, Jiacheng, Dong, Xin, Fu, Yonggan, Belcak, Peter, Ye, Hanrong, Yin, Hongxu, Dong, Yi, Bakhturina, Evelina, Yu, Tao, Choi, Yejin, Kautz, Jan, Molchanov, Pavlo

arXiv.org Artificial IntelligenceNov-27-2025

Large language models are powerful generalists, yet solving deep and complex problems such as those of the Humanity's Last Exam (HLE) remains both conceptually challenging and computationally expensive. We show that small orchestrators managing other models and a variety of tools can both push the upper bound of intelligence and improve efficiency in solving difficult agentic tasks. We introduce ToolOrchestra, a method for training small orchestrators that coordinate intelligent tools. ToolOrchestra explicitly uses reinforcement learning with outcome-, efficiency-, and user-preference-aware rewards. Using ToolOrchestra, we produce Orchestrator, an 8B model that achieves higher accuracy at lower cost than previous tool-use agents while aligning with user preferences on which tools are to be used for a given query. On HLE, Orchestrator achieves a score of 37.1%, outperforming GPT-5 (35.1%) while being 2.5x more efficient. On tau2-Bench and FRAMES, Orchestrator surpasses GPT-5 by a wide margin while using only about 30% of the cost. Extensive analysis shows that Orchestrator achieves the best trade-off between performance and cost under multiple metrics, and generalizes robustly to unseen tools. These results demonstrate that composing diverse tools with a lightweight orchestration model is both more efficient and more effective than existing methods, paving the way for practical and scalable tool-augmented reasoning systems.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2511.21689

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

The Catastrophic Paradox of Human Cognitive Frameworks in Large Language Model Evaluation: A Comprehensive Empirical Analysis of the CHC-LLM Incompatibility

Reddy, Mohan

arXiv.org Artificial IntelligenceNov-25-2025

This investigation presents an empirical analysis of the incompatibility between human psychometric frameworks and Large Language Model evaluation. Through systematic assessment of nine frontier models including GPT-5, Claude Opus 4.1, and Gemini 3 Pro Preview using the Cattell-Horn-Carroll theory of intelligence, we identify a paradox that challenges the foundations of cross-substrate cognitive evaluation. Our results show that models achieving above-average human IQ scores ranging from 85.0 to 121.4 simultaneously exhibit binary accuracy rates approaching zero on crystallized knowledge tasks, with an overall judge-binary correlation of r = 0.175 (p = 0.001, n = 1800). This disconnect appears most strongly in the crystallized intelligence domain, where every evaluated model achieved perfect binary accuracy while judge scores ranged from 25 to 62 percent, which cannot occur under valid measurement conditions. Using statistical analyses including Item Response Theory modeling, cross-vendor judge validation, and paradox severity indexing, we argue that this disconnect reflects a category error in applying biological cognitive architectures to transformer-based systems. The implications extend beyond methodology to challenge assumptions about intelligence, measurement, and anthropomorphic biases in AI evaluation. We propose a framework for developing native machine cognition assessments that recognize the non-human nature of artificial intelligence.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2511.18302

Country: North America > United States > California > Santa Clara County (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (0.32)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Tiny Model, Big Logic: Diversity-Driven Optimization Elicits Large-Model Reasoning Ability in VibeThinker-1.5B

Xu, Sen, Zhou, Yi, Wang, Wei, Min, Jixin, Yin, Zhibin, Dai, Yingwei, Liu, Shixi, Pang, Lianyu, Chen, Yirong, Zhang, Junlin

arXiv.org Artificial IntelligenceNov-11-2025

Challenging the prevailing consensus that small models inherently lack robust reasoning, this report introduces VibeThinker-1.5B, a 1.5B-parameter dense model developed via our Spectrum-to-Signal Principle (SSP). This challenges the prevailing approach of scaling model parameters to enhance capabilities, as seen in models like DeepSeek R1 (671B) and Kimi k2 (>1T). The SSP framework first employs a Two-Stage Diversity-Exploring Distillation (SFT) to generate a broad spectrum of solutions, followed by MaxEnt-Guided Policy Optimization (RL) to amplify the correct signal. With a total training cost of only $7,800, VibeThinker-1.5B demonstrates superior reasoning capabilities compared to closed-source models like Magistral Medium and Claude Opus 4, and performs on par with open-source models like GPT OSS-20B Medium. Remarkably, it surpasses the 400x larger DeepSeek R1 on three math benchmarks: AIME24 (80.3 vs. 79.8), AIME25 (74.4 vs. 70.0), and HMMT25 (50.4 vs. 41.7). This is a substantial improvement over its base model (6.7, 4.3, and 0.6, respectively). On LiveCodeBench V6, it scores 51.1, outperforming Magistral Medium's 50.3 and its base model's 0.0. These findings demonstrate that small models can achieve reasoning capabilities comparable to large models, drastically reducing training and inference costs and thereby democratizing advanced AI research.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2511.06221

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback